An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets
نویسندگان
چکیده
When intense redaction is needed to protect data subjects’ confidentiality, statistical agencies can release synthetic data, in which identifying or sensitive values are replaced with draws from statistical models estimated from the confidential data. Specifying accurate synthesis models can be a difficult and labor intensive task with standard parametric approaches. We describe and empirically evaluate four easy-to-implement, nonparametric synthesizers based on machine learning algorithms—classification and regression trees, bagging, random forests, and support vector machines—on their potential to preserve analytical validity and reduce disclosure risks. The results suggest that synthesizers based on regression trees can provide high utility with low disclosure risks.
منابع مشابه
An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering
Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...
متن کاملSynthetic Datasets for the German IAB Establishment Panel
Disseminating microdata to the public that provide a high level of data utility while at the same time guaranteeing the confidentiality of the survey respondent is a difficult task. Generating multiply imputed synthetic datasets is an innovative statistical disclosure limitation technique with the potential of enabling the data disseminating agency to achieve this twofold goal. So far, the appr...
متن کاملOn the Generation of Spatiotemporal Datasets
An efficient benchmarking environment for spatiotemporal access methods should at least include modules for: generating synthetic datasets, storing datasets (real datasets included), collecting and running access structures, and visualizing experimental results. Focusing on the dataset repository module, a collection of synthetic data that would simulate a variety of real life scenarios is requ...
متن کاملNonparametric Estimation of Multi-View Latent Variable Models
Spectral methods have greatly advanced the estimation of latent variable models, generating a sequence of novel and efficient algorithms with strong theoretical guarantees. However, current spectral algorithms are largely restricted to mixtures of discrete or Gaussian distributions. In this paper, we propose a kernel method for learning multi-view latent variable models, allowing each mixture c...
متن کاملComputational aspects of nonparametric smoothing with illustrations from the sm library
Smoothing techniques such as density estimation and nonparametric regression are widely used in applied work and the basic estimation procedures can be implemented relatively easily in standard statistical computing environments. However, computationally e2cient procedures quickly become necessary with large datasets, many evaluation points or more than one covariate. Further computational issu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computational Statistics & Data Analysis
دوره 55 شماره
صفحات -
تاریخ انتشار 2011